Skip to content

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575

Open
mdibaiee wants to merge 1 commit into
apache:masterfrom
mdibaiee:metadata-null-counts-truncated
Open

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575
mdibaiee wants to merge 1 commit into
apache:masterfrom
mdibaiee:metadata-null-counts-truncated

Conversation

@mdibaiee
Copy link
Copy Markdown

Rationale for this change

Missing null_count statistics for columns in parquet files can cause issues with downstream consumers of these files. It is not necessary to omit this statistic for columns which are larger than the truncation configuration, since despite the truncation, their nullability can be asserted with confidence. It is reasonable to keep omitting min/max statistics due to the rationale explained in the comment in the code.

What changes are included in this PR?

Always add null_count statistics for columns in parquet files, unconditional of their size.

Are these changes tested?

Yes, TestParquetMetadataConverter.java has been updated to reflect these changes

Are there any user-facing changes?

I think we can consider the additional null_count statistic's appearance as a user-facing change

Closes #3574

@mdibaiee
Copy link
Copy Markdown
Author

mdibaiee commented Jun 2, 2026

@wgtmac hey, thanks for the initial review. any chance of getting the workflows approved for running so the checks also run and we are closer to merging?

If any help from our side can help expedite this please let us know, as we have customers blocked on this and are happy to help as much as possible. Appreciate your time 🙇🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

null_count is omitted for large columns in parquet files

2 participants